Fix KubernetesExecutor leaking a Manager process when reading running task logs#68800
Merged
Merged
Conversation
… task logs The API server constructs a KubernetesExecutor solely to call get_task_log() for RUNNING tasks (via FileTaskHandler -> ExecutorLoader.get_default_executor) and never starts or ends it. KubernetesExecutor.__init__ eagerly created a multiprocessing.Manager(), which forks a serve_forever child process. Because that instance is cached per process and never shut down, the Manager child is orphaned -- one leaked process (~350-400 MB resident) per API-server worker, growing with worker recycling and pushing the API server toward OOM. get_task_log() only needs the kube client and pod namespace; it never touches the task/result queues. Create the Manager and its queues lazily in start() (the scheduling loop is their only consumer), mirroring how LocalExecutor already defers process/queue creation. end() now no-ops when the executor was never started. Constructing the executor for log reading no longer spawns a Manager.
a3ebff1 to
1e9a18e
Compare
Contributor
|
I didn't review both PRs in detail, but there's another PR that also fixes the issue: #68697. |
amoghrajesh
approved these changes
Jun 30, 2026
amoghrajesh
left a comment
Contributor
There was a problem hiding this comment.
LGTM but CI needs fixing
shahar1
approved these changes
Jul 1, 2026
1 task
14 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
When the API server serves logs for a RUNNING task instance,
FileTaskHandlerconstructs the configured executor to callget_task_log()and never starts or shuts it down.KubernetesExecutor.__init__eagerly created amultiprocessing.Manager(), which forks aserve_foreverchild process. Because the executor instance is cached per process, this leaks one orphanedManagerprocess (~350-400 MB resident) per API-server worker. With gunicorn/uvicorn worker recycling, each refreshed worker'sManagerreparents to PID 1 and is never reaped, so they accumulate and push the API server toward OOM.get_task_log()only needs the kube client and the pod namespace; it never touchestask_queue/result_queue/_manager. The Manager is purely a scheduling-loop resource.Closes #68693.
Fix
Create the
multiprocessing.Manager()and its queues lazily instart()instead of__init__(). The scheduling loop (start/execute_async/sync/end) is their only consumer, and the scheduler always callsstart()first.end()now no-ops when the executor was never started. Constructing the executor purely to read logs no longer spawns a Manager.Design rationale
start()and not a lazy property? The queues are only used after the scheduler callsstart(); nothing touches them between construction andstart(). Tying creation tostart()keeps the lifecycle explicit and pairs cleanly withend().LocalExecutor.LocalExecutoralready defers its queue/process creation tostart(), so it never leaked on this path. This change bringsKubernetesExecutorin line with that pattern, which also coversLocalKubernetesExecutor(its embeddedKubernetesExecutoris constructed in__init__and only started viaLocalKubernetesExecutor.start()).Affected path
FileTaskHandler._read()calls the executor'sget_task_logfor running tasks: file_task_handler.py#L637Backward compatibility
No public API change.
task_queue/result_queue/_managerareNoneuntilstart()is called; all internal consumers run afterstart(). Behavior for a started executor (the scheduler) is unchanged.